The Multi-Lingual Sentiment Analysis Pipeline project delivers an end-to-end NLP solution for accurate sentiment detection (positive/negative/neutral) across 12+, handling informal text, emojis, slang, and code-mixing. It fine-tunes XLM-RoBERTa on diverse multilingual datasets, includes advanced preprocessing modules, compares deep learning vs. rule-based baselines, and provides an interactive Streamlit dashboard for real-time and batch analysis. The system achieves 89% average accuracy/F1, outperforms rule-based by 25% on complex inputs, reduces analysis time by 70%, and was completed over 8.5 months from March to November 2025 for global social media and customer feedback applications.
The architecture follows a modular pipeline: input text undergoes preprocessing (emoji conversion, slang normalization, code-mix detection), feeds into fine-tuned XLM-RoBERTa for inference, optionally routes to rule-based baseline for comparison, and outputs results with confidence scores via Streamlit dashboard. This design ensures language-agnostic handling, real-time visualization (pie charts, distributions), batch processing, and easy deployment, focusing on 12 languages with transfer learning for low-resource ones.
The system uses HuggingFace Transformers for XLM-RoBERTa fine-tuning and inference, Python for scripting and preprocessing (emoji, re, langdetect, NLTK), and Streamlit for the interactive web dashboard. Additional libraries include torch for training, datasets from HuggingFace Hub, and VADER/adapted lexicons for rule-based comparison; supports GPU acceleration and caching for efficiency.
The core model fine-tunes XLM-RoBERTa-base with a classification head (cross-entropy loss, 2e-5 LR, 5 epochs) on multilingual datasets (ML-SENT, Twitter/reviews). Features include emoji-to-text (demojize), slang dictionaries/regex per language, code-mix segmentation with langdetect and per-segment stops. Rule-based baseline uses adapted VADER/lexicons for comparison, with DL achieving 89% F1 vs. rule-based 60-65% on code-mixed/nuanced text, plus confidence scores and visualizations.
Data processing curates multilingual corpora from HuggingFace Datasets/Twitter (12 languages), preprocesses with emoji handling, slang normalization (dicts/regex), code-mix detection (langdetect + segmentation), stopword removal, and augmentation for low-resource languages. Fine-tuning uses tokenized inputs (max 512), inference applies the same pipeline, with caching for dashboard speed; handles 100+ queries/min and batch uploads.
Testing includes unit for preprocessing modules, integration for pipeline flow, accuracy/F1 on held-out multilingual benchmarks (>85%), and usability for dashboard (batch/real-time). Deployment hosts Streamlit on Sharing/Heroku with caching/async, uses phased rollout with GPU options, and supports rollback via model versioning if issues arise.
Post-deployment, monitor inference latency/accuracy via Streamlit logs, periodic re-fine-tuning on new data, and dashboard usage, aiming for >99% uptime and <2s responses. Maintenance includes quarterly updates for slang dictionaries/languages, monthly performance audits, and cost controls (caching, CPU fallback), with alerts for low-confidence predictions.